2024-09-08
EUROSENSE 2024
A Sense of Global Culture
11th Conference on Sensory and Consumer Research
8-11 September 2024
Dublin, Ireland
Hello! 👋
Hello! 👋
Ruben Rama
Global Sensory and Consumer Insights Data and Knowledge Manager
Step-by-step guide to help sensory and consumer scientists start their journey into Natural Language Processing (NLP) analysis.
Natural Language Processing (NLP) is a field of Artificial Intelligence that makes human language intelligible to machines.
NLP studies the rules and structure of language, and create intelligent systems capable of:
Part One: Text Mining and Exploratory Analysis
Part Two: Tidy Sentiment Analysis in R
Part Three: Topic Modelling
tidyverseWe will be using tidy principles.
The tidyverse is an opinionated collection of R packages designed for data science.
All packages share an underlying design philosophy, grammar, and data structures.
tidyversetidy data has a specific structure:
1
tidyverseWe can install the complete tidyverse with:
tidyverse%>%
|>
Prepare your Questions!
🏁
A Friendly Place
Data can be organized into three categories:
Text mining or text analysis is the process of exploring and analyzing unstructured or semi-structured text data to identify:
From a sensory and consumer perspective, text data can come from lots of different sources:
A simplistic explanation of a typical text mining can include the following steps:
Original data was sourced from Kaggle, from the Amazon Alexa Reviews dataset.
Data file is available in the github repository.
We can use the read_csv() function from readr (part of the tidyverse) to load the data:
We can also use the file.choose() function.
That will bring up a file explorer window that will allow us to interactively choose the required file:
# A tibble: 3,150 × 5
stars date product review feedback
<dbl> <chr> <chr> <chr> <dbl>
1 5 31-Jul-18 Charcoal Fabric Love my Echo! 1
2 5 31-Jul-18 Charcoal Fabric Loved it! 1
3 4 31-Jul-18 Walnut Finish Sometimes while playing a game,… 1
4 5 31-Jul-18 Charcoal Fabric I have had a lot of fun with th… 1
5 5 31-Jul-18 Charcoal Fabric Music 1
6 5 31-Jul-18 Heather Gray Fabric I received the echo as a gift. … 1
7 3 31-Jul-18 Sandstone Fabric Without having a cellphone, I c… 1
8 5 31-Jul-18 Charcoal Fabric I think this is the 5th one I'v… 1
9 5 30-Jul-18 Heather Gray Fabric looks great 1
10 5 30-Jul-18 Heather Gray Fabric Love it! I’ve listened to songs… 1
# ℹ 3,140 more rows
We have 3150 reviews.
We may want to remove any duplicated reviews by using the distinct() function from dplyr:
We have 3150 reviews.
We may want to remove any duplicated reviews by using the distinct() function from dplyr:
We have 3150 reviews.
We may want to remove any duplicated reviews by using the distinct() function from dplyr:
# A tibble: 2,435 × 5
stars date product review feedback
<dbl> <chr> <chr> <chr> <dbl>
1 5 31-Jul-18 Charcoal Fabric Love my Echo! 1
2 5 31-Jul-18 Charcoal Fabric Loved it! 1
3 4 31-Jul-18 Walnut Finish Sometimes while playing a game,… 1
4 5 31-Jul-18 Charcoal Fabric I have had a lot of fun with th… 1
5 5 31-Jul-18 Charcoal Fabric Music 1
6 5 31-Jul-18 Heather Gray Fabric I received the echo as a gift. … 1
7 3 31-Jul-18 Sandstone Fabric Without having a cellphone, I c… 1
8 5 31-Jul-18 Charcoal Fabric I think this is the 5th one I'v… 1
9 5 30-Jul-18 Heather Gray Fabric looks great 1
10 5 30-Jul-18 Heather Gray Fabric Love it! I’ve listened to songs… 1
# ℹ 2,425 more rows
Briefly, let’s just focus on one product.
We use filter() and summarise() (or summarize()) functions (from dplyr):
We may want to group by product and then obtain a summary of the star rating.
We can use group_by() and summarise() (also from dplyr)
# A tibble: 16 × 2
product stars_mean
<chr> <dbl>
1 Black 4.23
2 Black Dot 4.45
3 Black Plus 4.37
4 Black Show 4.48
5 Black Spot 4.31
6 Charcoal Fabric 4.74
7 Configuration: Fire TV Stick 4.59
8 Heather Gray Fabric 4.70
9 Oak Finish 4.86
10 Sandstone Fabric 4.36
11 Walnut Finish 4.8
12 White 4.14
13 White Dot 4.42
14 White Plus 4.36
15 White Show 4.28
16 White Spot 4.34
We can arrange() the results to show in descending order (yes! also dplyr):
# A tibble: 16 × 2
product stars_mean
<chr> <dbl>
1 Oak Finish 4.86
2 Walnut Finish 4.8
3 Charcoal Fabric 4.74
4 Heather Gray Fabric 4.70
5 Configuration: Fire TV Stick 4.59
6 Black Show 4.48
7 Black Dot 4.45
8 White Dot 4.42
9 Black Plus 4.37
10 White Plus 4.36
11 Sandstone Fabric 4.36
12 White Spot 4.34
13 Black Spot 4.31
14 White Show 4.28
15 Black 4.23
16 White 4.14
But we cannot summarise unstructured or categorical data!
# A tibble: 16 × 2
product review_mean
<chr> <dbl>
1 Black NA
2 Black Dot NA
3 Black Plus NA
4 Black Show NA
5 Black Spot NA
6 Charcoal Fabric NA
7 Configuration: Fire TV Stick NA
8 Heather Gray Fabric NA
9 Oak Finish NA
10 Sandstone Fabric NA
11 Walnut Finish NA
12 White NA
13 White Dot NA
14 White Plus NA
15 White Show NA
16 White Spot NA
If what we want is to understand the number of reviews per product, we can summarize with n() after grouping by product.
# A tibble: 16 × 2
product number_rows
<chr> <int>
1 Black 261
2 Black Dot 252
3 Black Plus 270
4 Black Show 260
5 Black Spot 241
6 Charcoal Fabric 219
7 Configuration: Fire TV Stick 342
8 Heather Gray Fabric 79
9 Oak Finish 7
10 Sandstone Fabric 45
11 Walnut Finish 5
12 White 91
13 White Dot 92
14 White Plus 78
15 White Show 85
16 White Spot 108
Alternatively, there is a tidy way to achieve the same by using count() (thanks dplyr!)
# A tibble: 16 × 2
product n
<chr> <int>
1 Black 261
2 Black Dot 252
3 Black Plus 270
4 Black Show 260
5 Black Spot 241
6 Charcoal Fabric 219
7 Configuration: Fire TV Stick 342
8 Heather Gray Fabric 79
9 Oak Finish 7
10 Sandstone Fabric 45
11 Walnut Finish 5
12 White 91
13 White Dot 92
14 White Plus 78
15 White Show 85
16 White Spot 108
Including sort = TRUE arranges the results in descending order.
# A tibble: 16 × 2
product n
<chr> <int>
1 Configuration: Fire TV Stick 342
2 Black Plus 270
3 Black 261
4 Black Show 260
5 Black Dot 252
6 Black Spot 241
7 Charcoal Fabric 219
8 White Spot 108
9 White Dot 92
10 White 91
11 White Show 85
12 Heather Gray Fabric 79
13 White Plus 78
14 Sandstone Fabric 45
15 Oak Finish 7
16 Walnut Finish 5
There are different methods you can use to condition the text data:
textclean is a package containing several functions that automate the
The check_text() function performs a thorough analysis of the text, suggesting any pre-processing ought to be done (be ready for a long output!):
===========
CONTRACTION
===========
The following observations contain contractions:
8, 12, 20, 22, 27, 34, 40, 41, 47, 52...[truncated]...
This issue affected the following text:
8: I think this is the 5th one I've purchased. I'm working on getting one in every room of my house. I really like what features they offer specifily playing music on all Echos and controlling the lights throughout my house.
...[truncated]...
12: I love it! Learning knew things with it eveyday! Still figuring out how everything works but so far it's been easy to use and understand. She does make me laugh at times
...[truncated]...
20: I liked the original Echo. This is the same but shorter and with greater fabric/color choices. I miss the volume ring on top, now it's just the plus/minus buttons. Not a big deal but the ring w as comforting. :) Other than that, well I do like the use of a standard USB charger /port instead of the previous round pin. Other than that, I guess it sounds the same, seems to work the same, still answers to Alexa/Echo/Computer. So what's not to like? :)
...[truncated]...
22: We love Alexa! We use her to play music, play radio through iTunes, play podcasts through Anypod, and set reminders. We listen to our flash briefing of news and weather every morning. We rely on our custom lists. We like being able to voice control the volume. We're sure we'll continue to find new uses.Sometimes it's a bit frustrating when Alexa doesn't understand what we're saying.
...[truncated]...
27: I love my Echo. It's easy to operate, loads of fun.It is everything as advertised. I use it mainly to play my favorite tunes and test Alexa's knowledge.
...[truncated]...
34: The speakers sound pretty good for being so small and setup is pretty easy. I bought two and the reason I only rate it a 3 is I have followed the instructions for synching music to both units. I know I've done it correctly but they won't sync. That was my primary motivation for purchasing multiple units.
...[truncated]...
40: This is my first digital assistant so I'm giving this a good review. Speaker is really good for the cheap price on Prime day. Fun to play with and can be used as an alarm clock (That's what I was going to get in the first place, but I ended up with Echo). If you haven't had a go with one then definitely try it!What I like best is the number of other devices that it can connect with. My purchase came with a Smart Plug for $10 which I connect my lamp to. Alexa, turn of the lights!
...[truncated]...
41: My husband likes being able to use it to listen to music. I wish we knew all it's capabilities
...[truncated]...
47: It's like Siri, in fact, Siri answers more accurately then Alexa. I don't see a real need for it in my household, though it was a good bargain on prime day deals.
...[truncated]...
52: I'm still learning how to use it, but so far my Echo is great! The sound is actually much better than I was expecting.
...[truncated]...
*Suggestion: Consider running `replace_contraction`
====
DATE
====
The following observations contain dates:
946
This issue affected the following text:
946: item returned for repair ,receivded item back from repair 07/23/18 . parts missing no power cord included.please advise
*Suggestion: Consider running `replace date`
=====
DIGIT
=====
The following observations contain digits/numbers:
4, 8, 11, 19, 25, 34, 38, 40, 50, 53...[truncated]...
This issue affected the following text:
4: I have had a lot of fun with this thing. My 4 yr old learns about dinosaurs, i control the lights and play games like categories. Has nice sound when playing music as well.
...[truncated]...
8: I think this is the 5th one I've purchased. I'm working on getting one in every room of my house. I really like what features they offer specifily playing music on all Echos and controlling the lights throughout my house.
...[truncated]...
11: I sent it to my 85 year old Dad, and he talks to it constantly.
...[truncated]...
19: We love the size of the 2nd generation echo. Still needs a little improvement on sound
...[truncated]...
25: I got a second unit for the bedroom, I was expecting the sounds to be improved but I didnt really see a difference at all. Overall, not a big improvement over the 1st generation.
...[truncated]...
34: The speakers sound pretty good for being so small and setup is pretty easy. I bought two and the reason I only rate it a 3 is I have followed the instructions for synching music to both units. I know I've done it correctly but they won't sync. That was my primary motivation for purchasing multiple units.
...[truncated]...
38: Speaker is better than 1st generation Echo
...[truncated]...
40: This is my first digital assistant so I'm giving this a good review. Speaker is really good for the cheap price on Prime day. Fun to play with and can be used as an alarm clock (That's what I was going to get in the first place, but I ended up with Echo). If you haven't had a go with one then definitely try it!What I like best is the number of other devices that it can connect with. My purchase came with a Smart Plug for $10 which I connect my lamp to. Alexa, turn of the lights!
...[truncated]...
50: No different than Apple. To play a specific list of music you must have an Amazon of Spotify “plus/prime/etc” account. So you must pay to play “your” music. 3 stars for that reason. Everything else is 👍🏻 .
...[truncated]...
53: Works as you’d expect and then some. Also good sound quality considering price (70.00 on sale) and features.
...[truncated]...
*Suggestion: Consider using `replace_number`
========
EMOTICON
========
The following observations contain emoticons:
15, 20, 25, 52, 53, 60, 67, 69, 98, 104...[truncated]...
This issue affected the following text:
15: Just what I expected....
...[truncated]...
20: I liked the original Echo. This is the same but shorter and with greater fabric/color choices. I miss the volume ring on top, now it's just the plus/minus buttons. Not a big deal but the ring w as comforting. :) Other than that, well I do like the use of a standard USB charger /port instead of the previous round pin. Other than that, I guess it sounds the same, seems to work the same, still answers to Alexa/Echo/Computer. So what's not to like? :)
...[truncated]...
25: I got a second unit for the bedroom, I was expecting the sounds to be improved but I didnt really see a difference at all. Overall, not a big improvement over the 1st generation.
...[truncated]...
52: I'm still learning how to use it, but so far my Echo is great! The sound is actually much better than I was expecting.
...[truncated]...
53: Works as you’d expect and then some. Also good sound quality considering price (70.00 on sale) and features.
...[truncated]...
60: Love the echo I purchased it for company for my husband he is 83 and Alexa is great all he has to do is say her name and she tells him a joke and plays his favorite songs
...[truncated]...
67: Fast response which was amazing. Clear concise answers and sound quality is fantastic. I am still getting used to Alexia and have not usde Echo to its full extent.
...[truncated]...
69: Does everything as expected and more.
...[truncated]...
98: Love the Echo !!! I love the size, material and speaker quality. I have it hooked up to one light easily and will work on additional lights and thermostat. Next is Echo Dot for bedroom. There is a lot more to do with Echo that you think. Traffic, Weather, Trivia, etc.
...[truncated]...
104: It worked exactly as expected and the speaker has great sound. It is perfect for my classroom!
...[truncated]...
*Suggestion: Consider using `replace_emoticons`
====
HASH
====
The following observations contain Twitter style hash tags (e.g., #rstats):
70, 113, 133, 189, 231, 233, 249, 269, 381, 402...[truncated]...
This issue affected the following text:
70: I love my Echo! Works just like they said it would. I don't have a "smart" home, so I cannot speak about that function, but everything else about it is good.
...[truncated]...
113: i liked the sound . what is troubling is that I paid extra money to have access to a million more songs. Sometimes it doesn't work. Ex. Alexa play Italian songs" .don't have or don" t understand. or play the opera Tosca, response "sorry I don"t have that.
...[truncated]...
133: It's better than the 1st gen in every way except for one. I really miss the ring at the top for volume control. It was quicker and easier to just grab the top and twist without having to look at the buttons and find the "-" one and press it a few times. I also wish the bass was a bit better. All in all, it's a great device and I'm happy with it.
...[truncated]...
189: I don't think the "2nd gen" sounds as good as the 1st. But it does have an aux out... so you could add an external speaker. But if you are going to do that why wouldn't you just get a dot? 2nd issue is (which isn't unique to this unit but I don't understand why I can't override the default that prevents you from playing a blue tooth speaker while playing through a "group". I get there is a delay when using a BT speaker. But if the other units are not where they can be heard then I should be able to play as a group while using the BT speaker.
...[truncated]...
231: I am extremely impressed with this item. Bought it from the "warehouse" or "outlet" with a "minor imperfection. Can't tell it even has one. works great. Didn't come in packaging, but it was sealed up and had no damage and wasn't missing anything. I like the sound quality, I see some knock it. It's not a BOSE but it's more than great for our family. Easy to use, minor learning curve as it learns your voice. It integrates seamlessly with my other amazon services.Can't wait to get for my classroom too! It's a lot of fun even just as a speaker, let alone what I plan to do with it.
...[truncated]...
233: Awesome life changer! Seriously, I am able to start my morning with Alexa, by having her "wake"me up with her alarm and then playing me some music. She has gotten used to my voice, that I can be in another room and she will "listen" to what I say. I love both my echos!!! Don't hesitate, get one and for the price, the speaker is unbelievable. I am buying the cordless holder, so I can take the echo anywhere. Love my purchase and love alexa!!!!
...[truncated]...
249: I bought this to replace a "Dot" in my living room. Speaker is slightly better. It hears me better over the TV. Unfortunately, it doesn't understand or respond to my requests as well as the Dot. I frequently have to request 2 or 3 times to get it to do what I want. The Dot usually does exactly what I want on the first request. I don't consider it an upgrade.
...[truncated]...
269: My husband and I are what I would call "late adopters" when it come to technology, but we decide we would try and Echo to serve primarily as a music source. Wow, were we amazed and the great sound! We've also been having a great time listening to all of our favorite songs buy just asking Alexa. I may even buy one for my elderly Dad - I think he will enjoy having one to listen to music or even place his calls to us!
...[truncated]...
381: Six words, "Alexa, tell me a poop joke."
...[truncated]...
402: We just got this within the last couple of weeks, from what we can tell no issues! My son wanted and "alexa" for his 6th birthday so she could tell him jokes, he could ask her questions and to listen to music she does all of that and more. We all enjoy this one so much that I just bought a second one this week during Prime Deals Day!
...[truncated]...
*Suggestion: Consider using `qdapRegex::ex_tag' (to capture meta-data) and/or replace_hash
====
HTML
====
The following observations contain HTML markup:
1371, 1374
This issue affected the following text:
1371: Echo Show - White Great new addition to our Alexa home solution and now I can call back home and video chat directly from my phone. Great way to stay in touch with family.
1374: I like the fact that the messages are visual now as well as audible.I am puzzled because it will light up and make the "notification" sound but when I ask Alexa to read my notifications, she tells me that there are no new notifications. This has happened at least once a day for over a week now.
*Suggestion: Consider running `replace_html`
==========
INCOMPLETE
==========
The following observations contain incomplete sentences (e.g., uses ending punctuation like '...'):
13, 15, 31, 68, 118, 183, 189, 202, 216, 234...[truncated]...
This issue affected the following text:
13: I purchased this for my mother who is having knee problems now, to give her something to do while trying to over come not getting around so fast like she did.She enjoys all the little and big things it can do...Alexa play this song, What time is it and where, and how to cook this and that!
...[truncated]...
15: Just what I expected....
...[truncated]...
31: Still learning all the capabilities...but so far pretty pretty pretty good
...[truncated]...
68: You’re all I need...na na nana!
...[truncated]...
118: It's Alexa.... what else can you say
...[truncated]...
183: Got this as a gift and love it. I never would have bought one for myself, but now that I have it.... Allows me to play music on it from my amozon prime music ; that's worth it in and of itself. Also, gives new's briefs and tells jokes.
...[truncated]...
189: I don't think the "2nd gen" sounds as good as the 1st. But it does have an aux out... so you could add an external speaker. But if you are going to do that why wouldn't you just get a dot? 2nd issue is (which isn't unique to this unit but I don't understand why I can't override the default that prevents you from playing a blue tooth speaker while playing through a "group". I get there is a delay when using a BT speaker. But if the other units are not where they can be heard then I should be able to play as a group while using the BT speaker.
...[truncated]...
202: I owned an echo for overa year but the new lacks the easy way to increase or decrease volume without telling it to increase or decrease volume which is hard to do for my wife since English is her second language she was born in korea. But the sound from the echo is superb. So we’ll keep it..
...[truncated]...
216: Love these, great sound... easy to connect and use...
...[truncated]...
234: I am not super impressed with Alexa. When my Prime lapsed, she wouldn't play anything. She isn't smart enough to differentiate among spotify accounts so we can't use it for that either. She randomly speaks up when nobody is talking to her. Just today I unplugged her...not sure I'll ever use my Alexa again.
...[truncated]...
*Suggestion: Consider using `replace_incomplete`
====
KERN
====
The following observations contain kerning (e.g., 'The B O M B!'):
622, 1477, 1686
This issue affected the following text:
622: If you want to listen to music and have it come through several of the Echo/Dot units simultaneously, YOU MUST PAY A MONTHLY FEE. I thought this was Amazon, not Apple??!! I’ve paid for many of these so I could have one in each room, is that not enough of my money??!!??
1477: IT SEEMS TO BE OK BUT THE INSTRUCTIONS ARE WEAK AND I CAN NOT SEEM TO GET IT TO WORK. I AM GOING TO GET MY TECHY FRIEND TO HELP ME OUT AND I WILL UPDATE YOU LATER
1686: It get on sale after 2 days so ... CHECK EVENTS BEFORE U BUY THESE AMAZON PRODUCTS!
*Suggestion: Consider using `replace_kern`
=============
MISSING VALUE
=============
The following observations contain missing values:
86, 184, 220, 375, 407, 525, 655, 750, 774, 805...[truncated]...
*Suggestion: Consider running `drop_NA`
==========
MISSPELLED
==========
The following observations contain potentially misspelled words:
3, 7, 8, 12, 13, 18, 20, 21, 22, 25...[truncated]...
This issue affected the following text:
3: Somet<<im>>es while playing a game, you can answer a <<que>>stion correctly but <<Alexa>> says you got it wrong a<<nd>> answers the same as you. I like being able to turn lights on a<<nd>> off while away from home.
...[truncated]...
7: <<Wi>>thout having a cellphone, I cannot u<<se>> many of her features. I have an iPad but do not <<se>>e that of any u<<se>>. It IS a great alarm. If u r almost deaf, you can hear her alarm in the bedroom from out in the living room, so that is reason enough to keep her.It is fun to ask ra<<nd>>om <<que>>stions to hear her respon<<se>>. She does not <<se>>em to be very <<smartbon>> politics yet.
...[truncated]...
8: I think this is the 5th one I've <<pur>>ch<<a<<se>>>>d. I'm working on <<ge>>tting one in every room of my hou<<se>>. I really like what features they offer <<specifily>> playing music on all Echos a<<nd>> <<controll>>ing the lights throughout my hou<<se>>.
...[truncated]...
12: I <<lov>>e it! Lear<<ni>>ng knew things with it <<eveyday>>! Still figuring out how everything works but so far it's been easy to u<<se>> a<<nd>> u<<nd>>ersta<<nd>>. She does make me laugh at t<<im>>es
...[truncated]...
13: I <<pur>>ch<<a<<se>>>>d this for my mother who is having knee problems now, to give her something to do while trying to over come not <<ge>>tting arou<<nd>> so fast like she did.She enjoys all the li<<ttl>>e a<<nd>> big things it can do...<<Alexa>> play this song, What t<<im>>e is it a<<nd>> <<whe>>re, a<<nd>> how to cook this a<<nd>> that!
...[truncated]...
18: We have only been using <<Alexa>> for a couple of days a<<nd>> are having a lot of fun with our new toy. It like having a new hou<<se>>hold member! We are trying to learn all the different <<featues>> a<<nd>> benefits that come with it.
...[truncated]...
20: I liked the origi<<na>>l Echo. This is the same but shorter a<<nd>> with greater fabric/c<<olor>> choices. I miss the volume ring on top, now it's just the plus/minus buttons. Not a big deal but the ring w as comforting. :) Other than that, well I do like the u<<se>> of a sta<<nd>>ard USB char<<ge>>r /port instead of the <<pre>>vious rou<<nd>> pin. Other than that, I guess it sou<<nd>>s the same, <<se>>ems to work the same, still answers to <<Alexa>>/Echo/Computer. So what's not to like? :)
...[truncated]...
21: Love the Echo a<<nd>> how good the music sou<<nd>>s playing off it. <<Alexa>> u<<nd>>erst<<a<<nd>>s>> most comm<<a<<nd>>s>> but it is difficult at t<<im>>es for her to fi<<nd>> specific playlists or songs on <<Spotify>>. She is good with Amazon Music but is lacking in other major programs.
...[truncated]...
22: We <<lov>>e <<Alexa>>! We u<<se>> her to play music, play radio through iTunes, play podcasts through <<Anypod>>, a<<nd>> <<se>>t remi<<nd>>ers. We listen to our f<<las>>h briefing of news a<<nd>> weather every mor<<ni>>ng. We rely on our custom lists. We like being able to voice control the volume. We're <<su>>re we'll continue to fi<<nd>> new u<<se>>s.Somet<<im>>es it's a bit frustrating <<whe>>n <<Alexa>> doesn't u<<nd>>ersta<<nd>> what we're saying.
...[truncated]...
25: I got a s<<eco>><<nd>> u<<ni>>t for the bedroom, I was expecting the sou<<nd>>s to be <<im>>proved but I <<didnt>> really <<se>>e a difference at all. Overall, not a big <<im>>provement over the 1st <<ge>>neration.
...[truncated]...
*Suggestion: Consider running `hunspell::hunspell_find` & `hunspell::hunspell_suggest`
========
NO ALPHA
========
The following observations contain elements with no alphabetic (a-z) letters:
61, 1342, 1899, 2079
This issue affected the following text:
61: 😍
1342: 👍🏻
1899: ⭐⭐⭐⭐⭐
2079: 😄😄
*Suggestion: Consider cleaning the raw text or running `filter_row`
==========
NO ENDMARK
==========
The following observations contain elements with missing ending punctuation:
5, 9, 12, 19, 20, 24, 26, 30, 31, 32...[truncated]...
This issue affected the following text:
5: Music
...[truncated]...
9: looks great
...[truncated]...
12: I love it! Learning knew things with it eveyday! Still figuring out how everything works but so far it's been easy to use and understand. She does make me laugh at times
...[truncated]...
19: We love the size of the 2nd generation echo. Still needs a little improvement on sound
...[truncated]...
20: I liked the original Echo. This is the same but shorter and with greater fabric/color choices. I miss the volume ring on top, now it's just the plus/minus buttons. Not a big deal but the ring w as comforting. :) Other than that, well I do like the use of a standard USB charger /port instead of the previous round pin. Other than that, I guess it sounds the same, seems to work the same, still answers to Alexa/Echo/Computer. So what's not to like? :)
...[truncated]...
24: I love it. It plays my sleep sounds immediately when I ask
...[truncated]...
26: Amazing product
...[truncated]...
30: Just like the other one
...[truncated]...
31: Still learning all the capabilities...but so far pretty pretty pretty good
...[truncated]...
32: I like it
...[truncated]...
*Suggestion: Consider cleaning the raw text or running `add_missing_endmark`
====================
NO SPACE AFTER COMMA
====================
The following observations contain commas with no space afterwards:
101, 132, 163, 337, 365, 521, 730, 936, 946, 1060...[truncated]...
This issue affected the following text:
101: Great fun getting to know all the functions of this product. WOW -- family fun and homework help. Talking with other grandchildren,who also have an Echo, is a HUGE bonus. Can't wait to learn more and more and more
...[truncated]...
132: I love it,she is very helpful. I use her for remembering things and sleep. You can ask her just about anything. I have only had her for about a week so still learning her.
...[truncated]...
163: Stopped working after 2 weeks ,didn't follow commands!? Really fun when it was working?
...[truncated]...
337: Like, all types of fun,music, and more
...[truncated]...
365: This small echo dot is amazing the sounds that come out are great.it changes my nest thermostat,and my Phillips hue lights.without leaving my chair.
...[truncated]...
521: This refurbished item was fine,but I wasn't aware that there is a fee for having other echos set up in the rooms. However, it was missing the cordThank you
...[truncated]...
730: It's not perfect, but I really like this little gizmo. i bought it primarily for 2 purposes. First, so I could set wake-up alarms by individual days, and set the wake-up music individually by the day. Second, I wanted to control a bedroom light by voice, so I could shut it off as I was falling asleep, without having to get out of bed to turn a switch. The Echo Spot, together with a smart plug,has been able to accomplish that. A bonus has been getting Alexa to play music from my Amazon Prime playlists.What's not so great is that sometimes Alexa has a really hard time understanding instructions, and repeating and altering the way you say things can get pretty frustrating. Hopefully the AI gets better in the future, along with added functions.
...[truncated]...
936: Got for elderly parents,easy for them to use.just instructions could be more informative
...[truncated]...
946: item returned for repair ,receivded item back from repair 07/23/18 . parts missing no power cord included.please advise
...[truncated]...
1060: just what I expected,already have 2 other shows
...[truncated]...
*Suggestion: Consider running `add_comma_space`
=========
NON ASCII
=========
The following observations contain non-ASCII text:
6, 10, 23, 33, 37, 50, 51, 53, 61, 68...[truncated]...
This issue affected the following text:
6: I received the echo as a gift. I needed another Bluetooth or something to play music easily accessible, and found this smart speaker. Can’t wait to see what else it can do.
...[truncated]...
10: Love it! I’ve listened to songs I haven’t heard since childhood! I get the news, weather, information! It’s great!
...[truncated]...
23: Have only had it set up for a few days. Still adding smart home devices to it. The speaker is great for playing music. I like the size, we have it stationed on the kitchen counter and it’s not intrusive to look at.
...[truncated]...
33: She works well. Needs a learning command for unique, owners and users like. Alexa “learn” Tasha’s birthday. Or Alexa “learn” my definition of Fine. Etc. other than that she is great
...[truncated]...
37: Love my Echo. Still learning all the things it will do. Wasn’t able to follow instructions included in the package, but found a great one on U-Tube.
...[truncated]...
50: No different than Apple. To play a specific list of music you must have an Amazon of Spotify “plus/prime/etc” account. So you must pay to play “your” music. 3 stars for that reason. Everything else is 👍🏻 .
...[truncated]...
51: Excelente, lo unico es que no esta en español.
...[truncated]...
53: Works as you’d expect and then some. Also good sound quality considering price (70.00 on sale) and features.
...[truncated]...
61: 😍
...[truncated]...
68: You’re all I need...na na nana!
...[truncated]...
*Suggestion: Consider running `replace_non_ascii`
==================
NON SPLIT SENTENCE
==================
The following observations contain unsplit sentences (more than one sentence per element):
3, 4, 6, 7, 8, 10, 12, 13, 17, 18...[truncated]...
This issue affected the following text:
3: Sometimes while playing a game, you can answer a question correctly but Alexa says you got it wrong and answers the same as you. I like being able to turn lights on and off while away from home.
...[truncated]...
4: I have had a lot of fun with this thing. My 4 yr old learns about dinosaurs, i control the lights and play games like categories. Has nice sound when playing music as well.
...[truncated]...
6: I received the echo as a gift. I needed another Bluetooth or something to play music easily accessible, and found this smart speaker. Can’t wait to see what else it can do.
...[truncated]...
7: Without having a cellphone, I cannot use many of her features. I have an iPad but do not see that of any use. It IS a great alarm. If u r almost deaf, you can hear her alarm in the bedroom from out in the living room, so that is reason enough to keep her.It is fun to ask random questions to hear her response. She does not seem to be very smartbon politics yet.
...[truncated]...
8: I think this is the 5th one I've purchased. I'm working on getting one in every room of my house. I really like what features they offer specifily playing music on all Echos and controlling the lights throughout my house.
...[truncated]...
10: Love it! I’ve listened to songs I haven’t heard since childhood! I get the news, weather, information! It’s great!
...[truncated]...
12: I love it! Learning knew things with it eveyday! Still figuring out how everything works but so far it's been easy to use and understand. She does make me laugh at times
...[truncated]...
13: I purchased this for my mother who is having knee problems now, to give her something to do while trying to over come not getting around so fast like she did.She enjoys all the little and big things it can do...Alexa play this song, What time is it and where, and how to cook this and that!
...[truncated]...
17: Really happy with this purchase. Great speaker and easy to set up.
...[truncated]...
18: We have only been using Alexa for a couple of days and are having a lot of fun with our new toy. It like having a new household member! We are trying to learn all the different featues and benefits that come with it.
...[truncated]...
*Suggestion: Consider running `textshape::split_sentence`
====
TIME
====
The following observations contain timestamps:
802
This issue affected the following text:
802: When we first received this product, it was great. However, about a week ago, the device served up a video advertisement around 10:30pm at night and scared myself and my family. If you want to make sure you are protected and don't allow video directly in your home, the spot is not a device that can keep you safe.
*Suggestion: Consider using `replace_time`
===
URL
===
The following observations contain URLs:
1017
This issue affected the following text:
1017: https://www.amazon.com/dp/B073SQYXTW/ref=cm_cr_ryp_prd_ttl_sol_18
*Suggestion: Consider using `replace_url`
We can see that textclean has identified several pre-processing suggestions and solutions:
replace_contraction() function to replace any contractions with their multi-word forms (i.e., wasn’t to was not, i’d to i would, etc.)replace_date() with replacement = "" to replace any date with a blank character.replace_time() with replacement = "" to replace any time with a blank character.replace_emoji() to replace any emoji (i.e., 👌) with word equivalents .replace_emoticon() to replace any emoticon (i.e., ;) ) with word equivalents.replace_hash() to replace any #hashtag with a blank character.replace_number() to replace any number (including coma separated numbers) with a blank character.replace_html() with symbol = FALSE to remove any HTML markup.replace_incomplete() with replacement = "" to replace incomplete sentence end marks (i.e. …).replace_url() with replacement = "" to replace any URL with a blank character.replace_kern() to remove any added manual space (i.e., The B O M B ! to The BOMB!).replace_internet_slang() to replace the slang with longer word equivalents (i.e., ASAP to as soon as possible).Traditionally, we would need to call every function one at a time:
The benefit of using the |> is very apparent in situations like this:
review_data$review <- review_data$review |>
replace_contraction() |>
replace_date(replacement = "") |>
replace_time(replacement = "") |>
replace_email() |>
replace_emoticon() |>
replace_number() |>
replace_html(symbol = FALSE) |>
replace_incomplete(replacement = "") |>
replace_url(replacement = "") |>
replace_kern() |>
replace_internet_slang()In addition, we can use str_remove_all() from the stringr package (part of the tidyverse) to remove all matched patterns from a string.
tidytextOnce our text has been cleaned, we will be using tidytext to preprocess text data.
Text mining or text analysis methods are based on counting:
These segments are called tokens.
Therefore, we need to
This process is called tokenization.
tidytextFrom a tidy text framework, we need to both
tidy data structure.tidy text is defined as a one-token-per-row dataframe, where a token can be
tidytextWe can do this by using unnest_tokens() from tidytext.
unnest_tokens() requires at least two arguments:
word in our case, for simplicity), andreview in our case)tidytexttidytext# A tibble: 65,749 × 5
stars date product feedback word
<dbl> <chr> <chr> <dbl> <chr>
1 5 31-Jul-18 Charcoal Fabric 1 love
2 5 31-Jul-18 Charcoal Fabric 1 my
3 5 31-Jul-18 Charcoal Fabric 1 echo
4 5 31-Jul-18 Charcoal Fabric 1 loved
5 5 31-Jul-18 Charcoal Fabric 1 it
6 4 31-Jul-18 Walnut Finish 1 sometimes
7 4 31-Jul-18 Walnut Finish 1 while
8 4 31-Jul-18 Walnut Finish 1 playing
9 4 31-Jul-18 Walnut Finish 1 a
10 4 31-Jul-18 Walnut Finish 1 game
# ℹ 65,739 more rows
# A tibble: 4,263 × 2
word n
<chr> <int>
1 the 2675
2 i 2563
3 to 2241
4 it 2205
5 and 1810
6 a 1224
7 is 1202
8 my 1118
9 for 849
10 love 743
# ℹ 4,253 more rows
Stop words are overly common words that may not add any meaning to our results (e.g., “the”, “have”, “is”, “are”).
We want to exclude them from our textual data and our analysis completely.
There is no single universal list of stop words.
Nor any agreed upon rules for identifying stop words!
Luckily, there are several different lists to choose from…
We can get a specific stop word lexicon via the stopwords() function from the stopwords package, in a tidy format with one word per row.
Stop words lists are available in multiple languages too!
[1] "da" "de" "en" "es" "fi" "fr" "hu" "ir" "it" "nl" "no" "pt" "ro" "ru" "sv"
[1] "af" "ar" "hy" "eu" "bn" "br" "bg" "ca" "zh" "hr" "cs" "da" "nl" "en" "eo"
[16] "et" "fi" "fr" "gl" "de" "el" "ha" "he" "hi" "hu" "id" "ga" "it" "ja" "ko"
[31] "ku" "la" "lt" "lv" "ms" "mr" "no" "fa" "pl" "pt" "ro" "ru" "sk" "sl" "so"
[46] "st" "es" "sw" "sv" "th" "tl" "tr" "uk" "ur" "vi" "yo" "zu"
Different word lists contain different words!
# A tibble: 1 × 1
n
<int>
1 1298
We can sample a random list of these stop words.
By default, from smart in English (en).
[1] "again" "hereby" "each" "finds" "between" "works"
[7] "specify" "maybe" "uses" "took" "so" "interest"
[13] "enough" "to" "second"
To remove stops words from our tidy tibble using tidytext, we will use a join.
After we tokenize the reviews into words, we can use anti_join() to remove stop words.
If we want to select another source or another language, we can join using the get_stopwords() function directly.
Notice that stop_words already has a word column.
and a new column called word was created by the unest_tokens() function,
# A tibble: 65,749 × 5
stars date product feedback word
<dbl> <chr> <chr> <dbl> <chr>
1 5 31-Jul-18 Charcoal Fabric 1 love
2 5 31-Jul-18 Charcoal Fabric 1 my
3 5 31-Jul-18 Charcoal Fabric 1 echo
4 5 31-Jul-18 Charcoal Fabric 1 loved
5 5 31-Jul-18 Charcoal Fabric 1 it
6 4 31-Jul-18 Walnut Finish 1 sometimes
7 4 31-Jul-18 Walnut Finish 1 while
8 4 31-Jul-18 Walnut Finish 1 playing
9 4 31-Jul-18 Walnut Finish 1 a
10 4 31-Jul-18 Walnut Finish 1 game
# ℹ 65,739 more rows
so anti_join() automatically joins on the column word.
# A tibble: 65,749 × 5
stars date product feedback word
<dbl> <chr> <chr> <dbl> <chr>
1 5 31-Jul-18 Charcoal Fabric 1 love
2 5 31-Jul-18 Charcoal Fabric 1 my
3 5 31-Jul-18 Charcoal Fabric 1 echo
4 5 31-Jul-18 Charcoal Fabric 1 loved
5 5 31-Jul-18 Charcoal Fabric 1 it
6 4 31-Jul-18 Walnut Finish 1 sometimes
7 4 31-Jul-18 Walnut Finish 1 while
8 4 31-Jul-18 Walnut Finish 1 playing
9 4 31-Jul-18 Walnut Finish 1 a
10 4 31-Jul-18 Walnut Finish 1 game
# ℹ 65,739 more rows
# A tibble: 22,144 × 5
stars date product feedback word
<dbl> <chr> <chr> <dbl> <chr>
1 5 31-Jul-18 Charcoal Fabric 1 love
2 5 31-Jul-18 Charcoal Fabric 1 echo
3 5 31-Jul-18 Charcoal Fabric 1 loved
4 4 31-Jul-18 Walnut Finish 1 playing
5 4 31-Jul-18 Walnut Finish 1 game
6 4 31-Jul-18 Walnut Finish 1 answer
7 4 31-Jul-18 Walnut Finish 1 question
8 4 31-Jul-18 Walnut Finish 1 correctly
9 4 31-Jul-18 Walnut Finish 1 alexa
10 4 31-Jul-18 Walnut Finish 1 wrong
# ℹ 22,134 more rows
Let’s check the result.
# A tibble: 3,728 × 2
word n
<chr> <int>
1 love 743
2 echo 658
3 alexa 473
4 music 363
5 easy 268
6 sound 237
7 set 231
8 amazon 218
9 dot 211
10 product 205
# ℹ 3,718 more rows
Starting with our tidy text, we want to create an extra column called id to be able to identify the review.
Starting with our tidy text, we want to create an extra column called id to be able to identify the review.
# A tibble: 22,144 × 6
stars date product feedback id word
<dbl> <chr> <chr> <dbl> <int> <chr>
1 5 31-Jul-18 Charcoal Fabric 1 1 love
2 5 31-Jul-18 Charcoal Fabric 1 1 echo
3 5 31-Jul-18 Charcoal Fabric 1 2 loved
4 4 31-Jul-18 Walnut Finish 1 3 playing
5 4 31-Jul-18 Walnut Finish 1 3 game
6 4 31-Jul-18 Walnut Finish 1 3 answer
7 4 31-Jul-18 Walnut Finish 1 3 question
8 4 31-Jul-18 Walnut Finish 1 3 correctly
9 4 31-Jul-18 Walnut Finish 1 3 alexa
10 4 31-Jul-18 Walnut Finish 1 3 wrong
# ℹ 22,134 more rows
Visualizing counts with geom_col()
Visualizing counts with geom_col()
We can combine using the pipe |> to make it easier to read and more concise!
Too many words? We can filter() before visualizing.
Too many words? We can filter() before visualizing.
# A tibble: 27 × 2
word n
<chr> <int>
1 love 743
2 echo 658
3 alexa 473
4 music 363
5 easy 268
6 sound 237
7 set 231
8 amazon 218
9 dot 211
10 product 205
# ℹ 17 more rows
We can do a few tweaks to improve the count visualization.
Again, we can pipe everything together using |> to make it more concise.
Sometimes we discover a number of words in the data that aren’t informative and should be removed from our final list of words.
In this exercise, we will add a few words to our custom_stop_words data frame.
# A tibble: 1,149 × 2
word lexicon
<chr> <chr>
1 a SMART
2 a's SMART
3 able SMART
4 about SMART
5 above SMART
6 according SMART
7 accordingly SMART
8 across SMART
9 actually SMART
10 after SMART
# ℹ 1,139 more rows
For that, we can create a custom tibble/data frame called custom_stop_words.
The column names of the new data frame of custom stop words should match stop_words (i.e., ~word and ~lexicon).
For that, we can create a custom tibble/data frame called custom_stop_words.
The column names of the new data frame of custom stop words should match stop_words (i.e., ~word and ~lexicon).
We can now merge both lists into one that we can use for the analysis by using bind_rows()
After that, we can use it with anti_join() to remove all the stop words at once!
We can combine all the steps together now.
Let’s check if that word is still there…
# A tibble: 0 × 5
# ℹ 5 variables: id <int>, date <chr>, product <chr>, stars <dbl>, word <chr>
We are still able to pipe it all together with |> 🙀
To order the different words (i.e., tokens), we can use the fct_reorder() function from forcats, also part of tidyverse.
To order the different words (i.e., tokens), we can use the fct_reorder() function (or reorder()) from forcats, also part of tidyverse.
# A tibble: 27 × 3
word n word2
<chr> <int> <fct>
1 alexa 473 alexa
2 amazon 218 amazon
3 bought 163 bought
4 day 127 day
5 device 186 device
6 devices 112 devices
7 dot 211 dot
8 easy 268 easy
9 echo 658 echo
10 excuse 112 excuse
# ℹ 17 more rows
That way, we have a better looking bat plot
Now, by product!
# A tibble: 9,441 × 3
word product n
<chr> <chr> <int>
1 echo Black Plus 128
2 love Black Show 93
3 love Black Spot 93
4 echo Black Show 92
5 alexa Black Plus 89
6 easy Configuration: Fire TV Stick 87
7 love Configuration: Fire TV Stick 84
8 love Black Dot 78
9 love Black Plus 78
10 echo Black Spot 69
# ℹ 9,431 more rows
Better to group_by().
# A tibble: 9,441 × 3
# Groups: product [16]
word product n
<chr> <chr> <int>
1 07 Black Spot 1
2 1 Black 1
3 1 Black Plus 2
4 1 Charcoal Fabric 1
5 1 White Show 1
6 1 White Spot 1
7 10 Black 1
8 10 Black Spot 1
9 10 Heather Gray Fabric 1
10 10 White 1
# ℹ 9,431 more rows
Using slice_max() allows us to select the largest values of a variable.
# A tibble: 232 × 3
# Groups: product [16]
word product n
<chr> <chr> <int>
1 love Black 62
2 echo Black 58
3 refurbished Black 47
4 dot Black 46
5 alexa Black 37
6 bought Black 31
7 music Black 27
8 amazon Black 20
9 product Black 20
10 time Black 20
# ℹ 222 more rows
We will use ungroup() to remove the groups.
Followed by fct_reorder()
We will use ungroup() to remove the groups.
Followed by fct_reorder()
# A tibble: 232 × 4
word product n word2
<chr> <chr> <int> <fct>
1 love Black 62 love
2 echo Black 58 echo
3 refurbished Black 47 refurbished
4 dot Black 46 dot
5 alexa Black 37 alexa
6 bought Black 31 bought
7 music Black 27 music
8 amazon Black 20 amazon
9 product Black 20 product
10 time Black 20 time
# ℹ 222 more rows
To visualize, we need to use facet_wrap(), which allows us to “split” the graph by a determined facet.
As explained before, we can |> our way across the code:
tidy_review |>
count(word, product) |>
group_by(product) |>
slice_max(n, n = 10) |>
ungroup() |>
mutate(word2 = fct_reorder(word, n)) |>
ggplot(
aes(
x = word2,
y = n,
fill = product
)
) +
geom_col(show.legend = FALSE) +
facet_wrap(~product, scales = "free_y") +
coord_flip() +
labs(
title = "Review Word Counts"
)A word cloud is a visual representation of text data.
It is often used to visualize free form text.
They are usually composed of single words.
The importance of each tag is shown with font size or color.
There are several alternative packages to generate word clouds in R.
For this workshop, we will use the ggwordcloud package, as it follows ggplot2 syntax.
We will use the previously created word_count_filter containing the words with more than 100 mentions.
That seems to create a rather ugly word cloud.
We can improve it by piping a theme (i.e., theme_minimal()).
So far, all the words are the same size.
To obtain better proportionality, we need to use scale_size_area()
If we want a tighter knitted word cloud with more exagerated sizes, we can use scale_radius() instead:
ggwordcloud also allows us to change the shape of our word cloud, by using geom_text_wordcloud_area(shape = shape)
Finally, we can apply some colour to our word cloud
We can first group by stars using group_by() and then filter() by the desired rating.
We can first group by stars using group_by() and then filter() by the desired rating.
set.seed(13)
word_counts_stars <- tidy_review |>
group_by(stars) |>
filter(stars == 1) |>
count(word) |>
filter(n > 5) |>
arrange(desc(n))
word_counts_stars |>
ggplot(
aes(
label = word,
size = n,
color = n
)
) +
geom_text_wordcloud() +
scale_size_area(max_size = 20) +
theme_minimal() +
scale_color_gradient2()We can first filter by the desired product using filter()
We can first filter by the desired product using filter()
set.seed(13)
word_counts_product <- tidy_review |>
filter(product == "Charcoal Fabric") |>
count(word) |>
filter(n > 5) |>
arrange(desc(n))
word_counts_product |>
ggplot(
aes(
label = word,
size = n,
color = n
)
) +
geom_text_wordcloud() +
scale_size_area(max_size = 20) +
theme_minimal() +
scale_color_gradient2()Through the looking-glass
In the previous chapter, we explored in depth what we mean by the tidy text format and showed how this format can be used to approach questions about word frequency.
This allowed us to analyze which words are used most frequently in documents and to compare documents.
Let’s now address the topic of opinion mining or sentiment analysis.
When human readers approach a text, we use our understanding of the emotional intent of words to infer whether a section of text is positive 👍 or negative 👎 , or perhaps characterized by some other more nuanced emotion like surprise 😲 or confusion 😕.
As we have previously explored, different levels of analysis based on the text are possible:
In addition, more complex documents can also have dates, volumes, chapters, etc.
Word level analysis exposes detailed information and can be used as foundational knowledge for more advanced practices in topic modeling.
Therefore, a way to analyze the sentiment of a text is
This is an often-used approach, and an approach that naturally takes advantage of the tidy tool ecosystem.
There are different methods used for sentiment analysis, including:
In this tutorial, you will use the lexicon-based approach, but I would encourage you to investigate the other methods as well as their associated trade-offs.
Several distinct dictionaries exist to evaluate the opinion or emotion in text.
The tidytext package provides access to several sentiment lexicons, using the get_sentiments() function:
All four of these lexicons are based on unigrams, i.e., single words.
These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth.
Dictionary-based methods find the total sentiment of a piece of text by adding up the individual sentiment scores for each word in the text.
Not every English word is present in the lexicons because many English words are pretty neutral.
These methods do not take into account qualifiers before a word, such as in “no good” or “not true”; a lexicon-based method like this is based on unigrams only.
For many kinds of text (like the example in this workshop), there are not sustained sections of sarcasm or negated text, so this is not an important effect.
Also, we can use a tidy text approach to begin to understand what kinds of negation words are important in a given text.
The size of the chunk of text that we use to add up unigram sentiment scores can have an effect on an analysis.
A text the size of many paragraphs can often have positive and negative sentiment averaged out to about zero, while sentence-sized or paragraph-sized text often works better.
An example of sentence-based analysis using the sentimentr package is included in the Appendix (for those who are impatient!).
The AFINN lexicon (Nielsen 2011) can be loaded by using the get_sentiments() function.
# A tibble: 2,477 × 2
word value
<chr> <dbl>
1 abandon -2
2 abandoned -2
3 abandons -2
4 abducted -2
5 abduction -2
6 abductions -2
7 abhor -3
8 abhorred -3
9 abhorrent -3
10 abhors -3
# ℹ 2,467 more rows
The AFINN lexicon (Nielsen 2011) assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment
# A tibble: 1 × 2
min max
<dbl> <dbl>
1 -5 5
The bing lexicon (Hu and Liu 2004) categorizes words in a binary fashion into positive and negative categories.
# A tibble: 2 × 2
sentiment n
<chr> <int>
1 negative 4781
2 positive 2005
The nrc lexicon (Mohammad and Turney 2013) categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
# A tibble: 10 × 2
sentiment n
<chr> <int>
1 negative 3316
2 positive 2308
3 fear 1474
4 anger 1245
5 trust 1230
6 sadness 1187
7 disgust 1056
8 anticipation 837
9 joy 687
10 surprise 532
The Loughran lexicon (Loughran and McDonald 2011) was created for use with financial documents, and labels words with six possible sentiments important in financial contexts: “negative”, “positive”, “litigious”, “uncertainty”, “constraining”, or “superfluous”.
Dictionaries need to be appended by using inner_join().
The function drops any row in either data set that does not have a match in both data sets.
# A tibble: 11,000 × 6
id date product stars word sentiment
<int> <chr> <chr> <dbl> <chr> <chr>
1 1 31-Jul-18 Charcoal Fabric 5 love joy
2 1 31-Jul-18 Charcoal Fabric 5 love positive
3 3 31-Jul-18 Walnut Finish 4 question positive
4 3 31-Jul-18 Walnut Finish 4 wrong negative
5 4 31-Jul-18 Charcoal Fabric 5 fun anticipation
6 4 31-Jul-18 Charcoal Fabric 5 fun joy
7 4 31-Jul-18 Charcoal Fabric 5 fun positive
8 4 31-Jul-18 Charcoal Fabric 5 music joy
9 4 31-Jul-18 Charcoal Fabric 5 music positive
10 4 31-Jul-18 Charcoal Fabric 5 music sadness
# ℹ 10,990 more rows
After that, we can count the sentiments.
# A tibble: 10 × 2
sentiment n
<chr> <int>
1 anger 343
2 anticipation 1275
3 disgust 209
4 fear 446
5 joy 2118
6 negative 928
7 positive 3386
8 sadness 723
9 surprise 477
10 trust 1095
We can also count how many words are linked to which sentiment.
# A tibble: 1,500 × 3
word sentiment n
<chr> <chr> <int>
1 love joy 743
2 love positive 743
3 music joy 363
4 music positive 363
5 music sadness 363
6 time anticipation 154
7 prime positive 133
8 excuse negative 112
9 fun anticipation 107
10 fun joy 107
# ℹ 1,490 more rows
We will focus only in positive and negative sentiments.
We will focus only in positive and negative sentiments.
Of course, we can tidy and |> all that code 😏
tidy_review |>
inner_join(get_sentiments("nrc")) |>
filter(sentiment %in% c("positive", "negative")) |>
count(word, sentiment) |>
group_by(sentiment) |>
slice_max(n, n = 10) |>
ungroup() |>
mutate(word2 = fct_reorder(word, n)) |>
ggplot(
aes(
x = word2,
y = n,
fill = sentiment
)
) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free") +
coord_flip() +
labs(
title = "Sentiment Word Counts (nrc lexicon)",
x = "Words"
)Let’s use the bing lexicon for this experiment.
# A tibble: 10 × 3
stars sentiment n
<dbl> <chr> <int>
1 1 negative 160
2 1 positive 84
3 2 negative 114
4 2 positive 74
5 3 negative 115
6 3 positive 91
7 4 negative 253
8 4 positive 427
9 5 negative 461
10 5 positive 2263
For a more comfortable exploration, we may want to transpose the results.
That can be achieved with the pivot_wider() function (from tidyr), which will transform data from long to wide format.
# A tibble: 5 × 3
stars negative positive
<dbl> <int> <int>
1 1 160 84
2 2 114 74
3 3 115 91
4 4 253 427
5 5 461 2263
After that, we can use mutate() to create a new column with the overall sentiment rating.
# A tibble: 5 × 4
stars negative positive overall_sentiment
<dbl> <int> <int> <int>
1 1 160 84 -76
2 2 114 74 -40
3 3 115 91 -24
4 4 253 427 174
5 5 461 2263 1802
We can put it all together to obtain a visualization 🎉
tidy_review |>
inner_join(get_sentiments("bing")) |>
count(stars, sentiment) |>
pivot_wider(names_from = sentiment, values_from = n) |>
mutate(
overall_sentiment = positive - negative,
stars2 = reorder(stars, overall_sentiment)
) |>
ggplot(
aes(
x = stars2,
y = overall_sentiment,
fill = as.factor(stars)
)
) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(
title = "Overall Sentiment by Star rating (bing lexicon)",
subtitle = "Reviews for Alexa",
x = "Stars",
y = "Overall Sentiment"
)One advantage of having the data frame with both sentiment and word is that we can analyze word counts that contribute to each sentiment.
By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment.
# A tibble: 585 × 3
word sentiment n
<chr> <chr> <int>
1 love positive 743
2 easy positive 268
3 smart positive 143
4 excuse negative 112
5 fun positive 107
6 alarm negative 97
7 nice positive 77
8 awesome positive 66
9 perfect positive 66
10 amazing positive 63
# ℹ 575 more rows
This can be shown visually, and we can pipe straight into ggplot2, if we like, because of the way we are consistently using tools built for handling tidy data frames.
We can do the same, but slicing the data by the star rating gave by the consumers using group_by()
We can do the same, but slicing the data by the star rating gave by the consumers using group_by()
# A tibble: 1,001 × 4
stars word sentiment n
<dbl> <chr> <chr> <int>
1 5 love positive 658
2 5 easy positive 232
3 5 smart positive 109
4 5 fun positive 86
5 4 love positive 72
6 5 excuse negative 69
7 5 amazing positive 57
8 5 alarm negative 56
9 5 awesome positive 56
10 5 perfect positive 56
# ℹ 991 more rows
We can focus on 1 star rating using filter()
bing_word_counts_by_stars |>
filter(stars == 1) |>
group_by(sentiment) |>
slice_max(n, n = 10) |>
ungroup() |>
mutate(word = reorder(word, n)) |>
ggplot(
aes(
x = n,
y = word,
fill = sentiment
)
) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(
x = "Contribution to sentiment for 1 star reviews",
y = NULL
)Let’s see now the 5 star reviews
bing_word_counts_by_stars |>
filter(stars == 5) |>
group_by(sentiment) |>
slice_max(n, n = 10) |>
ungroup() |>
mutate(word = reorder(word, n)) |>
ggplot(
aes(
x = n,
y = word,
fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(
x = "Contribution to sentiment for 5 star reviews",
y = NULL
)Let’s compare side by side
Sometimes we want to visually present positive and negative words for the same text.
We can use the comparison.cloud() function from the wordcloud package.
For comparison.cloud(), we may need to turn the data frame into a matrix with the acast() function from the reshape2 package.
The size of a word’s text is in proportion to its frequency within its sentiment.
We can see the most important positive and negative words, but the sizes of the words are not comparable across sentiments.
We can do the same as before, focusing on the different star ratings.
1 star review
5 star review
Into the woods!
In this last part, we will build a model using Latent Dirichlet Allocation (LDA), to give a simple example of using LDA to generate collections of words that together suggest themes.
Clustering
Topic Modelling
Topic modelling is an unsupervised machine learning approach that can:
Latent Dirichlet Allocation (LDA) is a machine learning algorithm which discovers different topics underlying a collection of documents, where each document is a collection of words.
LDA makes the following two assumptions:
1 Every document is a combination of one or more topic(s) 2 Every topic is a mixture of words
LDA seeks to find groups of related words.
It is an iterative, generative algorithm, with two main steps:
The LDA algorithm requires the data to be presented as a document-term matrix (DTM).
Each document is a row, and each column is a term.
We can achieve that by piping (|>) our tidy data to the cast_dtm() function from tidytext, where:
id is the name of the field with the document name, andword is the name of field with the term.This tells us how many documents and terms we have, and that this is a very sparse matrix.
The word sparse implies that the DTM contains mostly empty fields.
We can look into the contents of a few rows and columns of the DTM by piping it into the as.matrix() function.
We can look into the contents of a few rows and columns of the DTM by piping it into the as.matrix() function.
We will use the LDA() function from the topicmodels package.
For our purposes, we will just need to know three parameters for the LDA() function:
k (let’s start with two, so k = 2),method =), andseed = 123).The method parameter defines the sampling algorithm to use.
The default is method = "VEM".
We will use the method = "Gibbs" sampling method (it does perform better in my experience)
An explanation of VEM or Gibbs methods is beyond this workshop, but I encourage everyone to read a bit more about these two methods.
Let’s fit the LDA model and explore the output.
Let’s fit the LDA model and explore the output.
A LDA_Gibbs topic model with 2 topics.
LDA() outputIf you REALLY want more details, we can use the glimpse() function:
Formal class 'LDA_Gibbs' [package "topicmodels"] with 16 slots
..@ seedwords : NULL
..@ z : int [1:22139] 2 2 1 1 1 1 1 1 2 2 ...
..@ alpha : num 25
..@ call : language LDA(x = dtm_review, k = 2, method = "Gibbs", control = list(seed = 123))
..@ Dim : int [1:2] 2330 3725
..@ control :Formal class 'LDA_Gibbscontrol' [package "topicmodels"] with 14 slots
..@ k : int 2
..@ terms : chr [1:3725] "07" "1" "10" "10.00" ...
..@ documents : chr [1:2330] "946" "90" "369" "864" ...
..@ beta : num [1:2, 1:3725] -11.65 -9.25 -7.54 -11.64 -11.65 ...
..@ gamma : num [1:2330, 1:2] 0.516 0.517 0.475 0.398 0.5 ...
..@ wordassignments:List of 5
.. ..$ i : int [1:20224] 1 1 1 1 1 1 1 1 1 1 ...
.. ..$ j : int [1:20224] 1 12 23 144 774 1669 1794 2119 2469 2630 ...
.. ..$ v : num [1:20224] 2 2 1 1 1 1 1 2 2 2 ...
.. ..$ nrow: int 2330
.. ..$ ncol: int 3725
.. ..- attr(*, "class")= chr "simple_triplet_matrix"
..@ loglikelihood : num -145622
..@ iter : int 2000
..@ logLiks : num(0)
..@ n : int 22139
LDA() outputMost easily, we can use the tidy() function with the matrix = "beta" argument to put it into a format that is easy to understand.
Passing beta provides us with the per-topic-per-word probabilities from the model.
To understand the model clearly, we need to see what terms are in each topic.
beta_wide <- lda_2_topics |>
mutate(topic = paste0("topic", topic)) |>
pivot_wider(names_from = topic, values_from = beta) |>
filter(topic1 > .001 | topic2 > .001) |>
mutate(log_ratio = log2(topic2 / topic1))
beta_wide |>
arrange(desc(abs(log_ratio))) |>
head(20) |>
arrange(desc(log_ratio)) |>
ggplot(
aes(
x = log_ratio,
y = term
)
) +
geom_col(show.legend = FALSE) +
labs(
title = "Terms with the greatest difference in beta between two topics"
)Don’t forget about the mixed-membership concept and that these topics are not meant to be completely disjoint.
You can fine tune the LDA algorithm using extra parameters to improve the model.
LDA models each document as a mix of topics and words.
With matrix = "gamma" we can investigate per-document-per-topic probabilities.
Each of these values represents an estimated percentage of the document’s words that are from each topic. Most of these reviews belong to more than one topic.
# A tibble: 4,660 × 3
document topic gamma
<chr> <int> <dbl>
1 1666 1 0.728
2 1466 1 0.716
3 1496 1 0.688
4 1646 1 0.683
5 1688 1 0.677
6 1601 1 0.671
7 1658 1 0.670
8 1624 1 0.655
9 1464 1 0.651
10 1038 2 0.648
# ℹ 4,650 more rows
We covered several topics:
Tidy and |>textcleantidytextWe explored:
We discussed:
Thank you! 🙌
Feel free to explore the Appendix!
sentimentrAnother package for lexicon-based sentiment analysis is sentimentr (Rinker 2021).
Unlike the tidytext package, sentimentr takes valence shifters (e.g., negation) into account, which can easily flip the polarity of a sentence with one word.
sentimentrsentimentrIn contrast to tidytext, for sentimentr we need the actual sentences than the individual tokens.
Therefore, we can use the original cleaned review_data, get individual sentences for each media briefing using the get_sentences() function, and then calculate sentiment scores per sentence via sentiment().
sentimentrsentimentr stars date product
1: 5 31-Jul-18 Charcoal Fabric
2: 5 31-Jul-18 Charcoal Fabric
3: 4 31-Jul-18 Walnut Finish
4: 4 31-Jul-18 Walnut Finish
5: 5 31-Jul-18 Charcoal Fabric
---
5834: 5 30-Jul-18 White Dot
5835: 4 29-Jul-18 Black Dot
5836: 5 29-Jul-18 Black Dot
5837: 5 31-Jul-18 Black Dot
5838: 5 31-Jul-18 Black Dot
review
1: Love my Echo!
2: Loved it!
3: Sometimes while playing a game, you can answer a question correctly but Alexa says you got it wrong and answers the same as you.
4: I like being able to turn lights on and off while away from home.
5: I have had a lot of fun with this thing.
---
5834: I have a couple friends that have a dot and do not mind the audio quality, but if you are bothered by that kind of thing I would go with the full size echo or make sure you hook the do up to some larger speakers.
5835: Good
5836: Nice little unit no issues
5837: The echo dot was easy to set up and use.
5838: It helps provide music, etc. to small spaces and was just what I was looking for.
feedback id element_id sentence_id word_count sentiment
1: 1 1 1 1 3 0.43301270
2: 1 2 2 1 2 0.35355339
3: 1 3 3 1 24 -0.34429620
4: 1 3 3 2 14 0.13363062
5: 1 4 4 1 10 0.23717082
---
5834: 1 2432 2432 3 46 -0.31331416
5835: 1 2433 2433 1 1 0.75000000
5836: 1 2434 2434 1 5 0.13416408
5837: 1 2435 2435 1 10 -0.06324555
5838: 1 2435 2435 2 16 0.55000000
sentimentr - Plotting by Star Ratingsentimentr - Plotting by Star RatingsentimentrWe can also look at sentiment analysis by whole reviews, instead of per sentence.
sentimentr_sentence <- review_data |>
get_sentences() |>
sentiment_by()
review_data_id <- review_data |>
mutate(id = row_number())
sentimentr_merged <- sentimentr_sentence |>
inner_join(review_data_id,
join_by(element_id == id))
sentimentr_merged |>
group_by(stars) |>
ggplot(
aes(
x = stars,
y = ave_sentiment,
fill = as.factor(stars)
)
) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(
title = "Overall Sentiment by Stars using sentimentr",
subtitle = "Reviews for Alexa",
x = "Stars",
y = "Overall Sentiment"
)sentimentrsentimentrFor this specific case, we can see that the sentiment analysis results are very similar between:
sentimentr at a sentence level
bing lexicon on a word by word basis
Unfortunately, there is no tidy way to create a word cloud with the wordcloud package 😒
Regardless, it is time to ask the wordcloud() function to read and plot our data.
There are some useful arguments to experiment with here:
min.freq and max.words set boundaries for how populated the wordcloud will berandom.order will put the largest word in the middle if set to FALSErot.per is the fraction of words that will be rotated in the graphicFinally, the words are arranged randomly somehow, and so for a repeatable graphic we need to specify a seed value with set.seed().
Changing the number of words in the cloud.
Using pre-defined colours.
Using funky colours, thanks to the RColorBrewer package and its large selection of colour palettes
If you need more customization (including non-latin characters), you can use the wordcloud2() function from the wordcloud2 package.
EuroSense 2024